
WORLD INTELLECTUAL PROPERTY ORGANIZATION ^^^^ ^ 

i International Bureau ^^^^ 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCI) 



(51) International Patent Classification ^ 
G06F9/38 



A2 



(11) International Publication Number: WO 93/20505 

(43) International Publication Date: 14 October 1993 (14.10.93) 



(21) International AppUcation Number: PCT/JP93/00375 

(22) International FlUng Date: 26 March 1993 (26.03.93) 



(30) Priority data: 
07/860,719 



31 March 1992(31.03.92) US 



(71) Applicant: SEIKO EPSON CORPORATION [JP/JPJ; 4-1, 

Nishi-Shinjuku 2-chome, Shinjuku-ku, Tolqro 163 (JP). 

(72) Inventors: GARG, Sanjiv ; 46820 Sentinel Drive, Fremont, 

CA 94539 (US). LUDONATO, Kevin, Ray ; 678 Galleon 
Court, San Jose, CA 95133 (US). 

(74) Agents: SUZUKI, Kisaburo et aL; Seiko Epson Corpora- 
tion, 4-1, Nishi-Shinjuku 2-chome, Shinjuku-ku, Tol^o 
163 (JP). 



(81) Designated States: JP, KR, European patent (AT, BE, CH, 
DE, DK, ES, FR, GB, GR, IE, IT, LU, MC, NL, PT, 
SE). 



Published 

Without international search report and to be republished 
upon receipt of that report. 



(54) Title: SUPERSCALAR RISC INSTRUCTION SCHEDULING 



(57) Abstract 

A register renaming system for out-of-order execution of a 
set of reduced instruction set computer instructions having addres- 
sable source and destination register fields, adapted for use in a 
computer having an instruction execution unit with a register file 
accessed by read address ports and for storing instruction op- 
erands. A data dependance check circuit is included for determin- 
ing data dependencies between the instructions. A tag assignment 
circuit generates one or more tags to specify the location of op- 
erands, based on the data dependencies determined by the data de- 
pendence check circuit. A set of register Hie port multiplexers se- 
lect the tags generated by the tag assignment circuit and pass the 
tags onto the read address ports of the register file for storing ex- 
ecution results. 
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DESCRIPTION 
SUPERSCALAR RISC INSTRUCTION SCHEDULING 



CROSS-REFERENCE TO RELATED APPLTC ATTONfil 

The following are commonly owned, co-pending applications; 
15 * "Semiconductor Floor Plan and Method for a Register Renaming Circuit", 

Serial No. 07/860,718, filed 3/31/92 concurrently filed with the present 
application (Attorney Docket No, SP041); 

* "High Performance RISC Microprocessor Architecture", Serial No. 
07/817,810, filed 1/8/92 (Attorney Docket No. SP015); 
20 * "Extensible RISC Microprocessor Architecture", Serial No. 07/817,809, filed 

1/8/92 (Attorney Docket No. SP021). 
The disclosures of the above applications are incorporated herein by reference. 

BACKGROUND OF THE INVENTION 

25 

1. Field of the Invention 

The present invention relates to superscalar reduced instruction set computers 
(RISC), more particularly, the present invention relates to instruction scheduling 
30 including register renaming and instruction issuing for superscalar RISC computers. 

2. Related Art 

A more detailed description of some of the basic concepts discussed in this 
35 application is found in a nimiber of references, including Mike Johnson, Superscalar 
Microprocessor Design (Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1991); 
John L. Hennessy et al.. Computer Architecture - A Quantitative Approach" 
(Morgan Kaufinann Publishers, Inc., San Mateo, California, 1990). Johnson's text, 
particularly Chapters 2, 6 and 7 provide an excellent discussion of the register 
40 renaming issues addressed by the present invention. 

A major consideration in a superscalar RISC processor is to how to execute 
multiple instructions in parallel and out-of-order, without incurring data errors due to 
dependencies inherent in such execution. Data dependency checking, register 
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renammg and instruction schednling are integral aspects of the solution. 

2.1 Storage Conflicts and Register Renaming 

5 True dependencies (sometimes called "flow dependencies" or "write-read" ? 
dependencies) are often grouped with anti-dependencies (also called "read-write" 
dependencies) and output dependencies (also called "write-wxite" dependencies) into a 
single group of instruction dependencies. The reason for this grouping is that each of 
these dependencies manifests itself through use of registers or other storage 

10 bcations. However, it is important to distinguish true dependencies from the other 
two. True dependencies represent the flow of data and information through a 
program. Anti- and output dependencies arise becauset at different points in time, 
registers or other storage locations hold different values for different computations. 
When instructions are issued in order and complete in order, there is a one-to-one 

15 correspondence between registers and values. At any given point in execution, a 
register identifier precisely identifies the value contained in the corresponding 
register. When instructions are issued out of order and complete out of order, 
correspondence between registers and values breaks down, and values conflict for 
registers. This problem is severe when the goal of register allocation is to keep as 

20 many values in as few registers as possible. Keeping a large number of values in a 
small number of registers creates a large number of conflicts when the execution 
order is changed from the order assumed by the register allocator. 

Anti- and output dependencies are more properly called "storage conflicts" 
because reusing storage locations (including registers) causes instructions to 

25 interfere with one another even though conflicting instructions are otherwise 
independent. Storage conflicts constrain instruction issue and reduce performance. 
But storage conflicts, like other resource conflicts, can be reduced or eliminated by 
duplicating the troublesome resource. 

30 2.2 Dependenc y H^A^yhprtif^mR 

Johnson also discusses in detail various dependency mechanisms, including: 
sofl^ware, register renaming, register renaming with a reorder buffer, register 
renaming with a fixture buffer, interlocks, the copying of operands in the instruction 
3 5 window to avoid dependences, and partial renaming. 

A conventional hardware implementation relies on sofi;ware to enforce 
dependencies between instructions. A compiler or other code generator can arrange 
the order of instructions so that the hardware cannot possibly see an instruction 
until it is free of true dependencies and storage conflicts. Unfortunately, this 
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approach runs into several problems. Software does not always know the latency of 
processor operations, and thus, cannot always know how to arrange instructions to 
avoid dependencies. There is the question of how the software prevents the hardware 
firom seeing an instruction until it is free of dependencies. Li a scalar processor with 
S low operation latencies, software can insert "no-ops" in the code to satisfy data 
dependencies without too much overhead. If the processor is attempting to fetch 
several instructions per cycle, or if some operations take several cycles to complete, 
the number of no-ops required to prevent the processor from seeing dependent 
instructions rapidly becomes excessive, causing an unacceptable increase in code 

10 size. The no-ops use a precious resource, the instruction cache, to encode 
dependencies between instructions. 

When a processor permits out^of-iorder issue, it is not at all dear what mechanism 
software should use to enforce dependencies. Software has little control over the 
behavior of the processor, so it is hard to see how software prevents the processor 

15 from decoding dependent instructions. The second consideration is that no existing 
binary code for any scalar processor enforces the dependencies in a superscalar 
processor, because the mode of execution is very different in the superscalar 
processor. Relying on software to enforce dependencies requires that the code be 
regenerated for the superscalar processor. Finally, the dependencies in the code are 

20 directly determined by the latencies in the hardware, so that the best code for each 
version of a superscalar processor depends on the implementation of that version. 

On the other hand, there is some motivation against hardware dependency 
techniques, because they are inherentiy complex. Assuming instructions with two 
input operands and one output value, as holds for typical RISC instructions, then 

25 there are five possible dependencies between any two instructions: two true 
dependencies, two anti-dependencies, and one output dependency. Furthermore, the 
number of dependencies between a group of instructions, such as a group of 
instructions in a window, varies with the square of the number of instructions in the 
group, because each instruction must be considered against every other instruction. 

3 0 Complexity is further multiplied by the number of instructions that the processor 
attempts to decode, issue, and complete in a single cycle. These actions introduce 
dependences. The only aid in reducing complexity is that the dependencies can be 
determined incrementally, over many cydes to help reduce the scope and complexity 
of the dependency hardware. 

35 One technique for removing storage conflicts is by providing additional registers 
that are used to reestablish the correspondence between registers and values. The 
additional registers are conventionally allocated dynamically by hardware, and the 
registers are associated with values needed by the program using "register 
renaming." To implement register renaming, processors typically allocate a new 
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register for every new value produced (i.e., for every instruction that writes a 

register). An instruction identifying the original raster, for the purpose of reading its 

value, obtains instead the value in the newly allocated register. Thus, hardware 

renames the original register identifier in the instruction to identify the new r^fister 

5 and correct value. The same register identifier in several different instructions may 

access different hardware registers, depending on the locations of register references 

with respect to raster assigimients. ' 

Consider the following code sequence where "op" is an operation, "Rn" represents 

a numbered register, and represents assigimient: 
10 R3b := R3a op R5a (1) 

R4b:=R3b+l (2) 
R3c:=R5a+l (3) 
R7b:=R3c opR4b(4) 

15 Each assigimient to a register creates a new "instance" of the register, denoted by 
an alphabetic subscript. The creation of a new instance for R3 in the third 
instruction avoids the anti- and output dependencies on the second and first 
instructions, respectively, and 3ret does not interfere with correctly supplying an 
operand to the fourth instruction. The assignment to R3 in the third instruction 

20 supersedes the assignment to B3 in the first instruction, causing R3c to become the 
new R3 seen by subsequent instructions until another instruction assigns a value to 
R3. 

Hardware that performs renaming creates each new register instance and 
destrojrs the instance when its value is superseded and there are no outstanding 

25 references to the value. This removes anti- and output dependencies and allows more 
instruction parallelism. Registers are still reused, but reuse is in line with the 
requirements of parallel execution. This is particularly helpfiiL with out-of-order issue, 
because storage conflicts introduce instruction issue constraints that are not really 
necessary to produce correct results. For example, in the preceding instruction 

30 sequence, renaming allows the third instruction to be issued immediately, whereas, 
without renaming, the instruction must be delayed until the first instruction is 
complete and the second instruction is issued. 

Another technique for reducing dependencies is to associate a single bit (called a 
"scoreboard bit") with each register. The scoreboard bit is used to indicate that a 

35 register has a pending update. When an instruction is decoded that will write a * 
register, the processor sets the associated scoreboard bit. The scoreboard bit is reset 
when the write actually occurs. Because there is only one scoreboard bit indicating 
whether or not there is a pending update, there can be only one such update for each 
register. The scoreboard stalls instruction decoding if a decoded instruction will 

40 update a register that already has a pending update (indicated by the scoreboard bit 
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being set). This avoids output dependencies by allowing only one pending update to a 
register at any given time. 

Register renaming, in contrast^ uses multiple-bit tags to identify the various 
uncomputed values, some of which values may be destined for the same processor 
5 register (that is, the same program-visible register). Conventional renaming requires 
hardware to allocate tags from a pool of available tags that are not currently 
associated with any value and requires hardware to &ee the tags to the pool once the 
values have been computed. Furthermore, since scoreboarding allows only one 
pending update to a given register, the processor is not concerned about which update 

10 is the most recent. 

A further technique for reducing dependencies is using register renaming with a 
"reorder buffer** which uses associative lookup. The associative lookup maps the 
register identifier to the reorder buffer entry as soon as the entry is allocated, and, to 
avoid output dependencies, the lookup is prioritized so that only the value for the 

1 5 most recent assignment is obtained if the register is assigned more than once. A tag 
is obtained if the result is not yet available. There can be as many instances of a 
given register as there are reorder buffer entries, so there are no storage conflicts 
between instructions. The values for the different instances are written firom the 
reorder buffer to the register file in sequential order. When the value for the final 

20 instance is written to the register file, the reorder buffer no longer maps the register; 
the register file contains the only instance of the register, and this is the most recent 
instance. 

However, renaming with a reorder buffer relies on the associative lookup in the 
reorder buffer to map register identifiers to values. In the reorder buffer, the 

25 associative lookup is prioritized so that the reorder buffer always provides the most 
recent value in the register of interest (or a tag). The reorder buffer also writes 
values to the register file in order, so that, if the value is not in the reorder buffer, the 
register file must contain the most recent value. 

In a still fiirther technique for reducing dependencies, associative lookup can be 

30 eliminated using a "fixture file." The fixture file does not have the properties of the 
reorder buffer discussed in the preceding paragraph. A value presented to the fixture 
file to be written may not be the most recent value destined for the corresponding 
register, and the value cannot be treated as the most recent value unless it actually 
is. The fixture file therefore keeps track of the most recent update and checks that 

35 each write corresponds to the most recent update before it actually performs the 
write. 

When an instruction is decoded, it accesses tags in the fixture file along with the 
operand values. If the register has one or more pending updates, the tag identifies 
the update value required by the decoded instruction. Once an instruction is decoded. 



-5- 



wo 93/20505 



PCr/JP93/00375 



other instructions may overwrite this instructions's source operands without being 
constrained by anti-dependencies, because the operands are copied into the 
instruction window. Output dependencies are handled by preventing the writing as a 
result into the future file if the result does not have a tag for the most recent value. 

5 Both anti- and output dependencies are handled without stalling instruction issue. 

If dependencies are not removed through renaming, "interlocks" must use to 
enforce dependencies. An interlock simply ddays the execution of an instruction imtil 
the instruction is free of dependencies. There are two ways to prevent an instruction 
from being executed: one way is to prevent the instruction from being decoded, and 

10 the other is to prevent the instruction from being issued. 

To improve performance over scoreboarding, interlocks are moved from the 
decoder to the instruction window using a "dispatch stack." The dispatch stack is an 
instruction window that augments each instruction in the window with dependency 
counts. There is a dependency count associated with the source register of each 

15 instruction in the window, giving the number of pending prior updates to the source 
register and thus the number of updates that must be completed before all possible 
true dependencies are removed. There are two similar dependency counts associated 
with the destination register of each instruction in the window, giving both the 
number of pending prior uses of the register (which is the number of anti- 

20 dependencies) and the number of pending prior updates to the register (which is the 
number of output dependencies). 

When an instruction is decoded and loaded into the dispatch stack, the 
dependency counts are set by comparing the instruction's raster identifiers with the 
register identifiers of all instructions already in the dispatch stack. As instructions 

25 complete, the dependency counts of instructions that are still in the window are 
decremented based on the source and destination register identifiers of completing 
instructions (the counts are decremented by a variable amount, depending on the 
number of instructions completed). An instruction is independent when all of its 
coxmts are zero. The use of counts avoids having to compare all instructions in the 

30 dispatch stack to all other instructions on every cyde. 

Anti-dependendes can be avoided altogether by copying operands to the 
instruction window (for example, to the reservation stations) during instruction 
decode. In this manner, the operands cannot be overwritten by subsequent register 
updates. Operands can be copied to eliminate anti-dependendes in any approach, 

35 independent of register renaming. The alternative to copying operands is to interlock 
anti-dependendes, but the comparators and/or counters required for these interlodcs 
are costly, considering the number of combinations of source and result registers to 
be compared. 

A tag can be supplied for the operand rather than the operand itself. This tag is 
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simply a means for the hardware to identify which value the instruction requires, so 
that, when the operand value is produced, it can be matched to the instruction. If 
there can be only one pending update to a register, the register identifier can serve as 
a tag (as with scoreboarding). If there can be more than one pending update to a 
5 register (as with renaming), there must be a mechanism for allocating result tags 
and insuring uniqueness. 

An alternative to scoreboarding interlocking is to allow multiple pending updates 
of registers to avoid stalling the decoder for output dependencies, but to handle anti- 
dependencies by copying operands (or tags) during decode. An instruction in the 

10 window is not issued until it is free of output dependencies, so the updates to each 
register are performed in the same order in which they wotild be performed with in- 
order completion, except that updates for different registers are out of order with 
respect to each other. This alternative has almost all of the capabilities of register 
renaming, lacking only the capabilily to issue instructions so that updates to the 

1 5 same register occur out of order. 

There appears to be no better alternative to renaming other than with a reorder 
buffer. Underlying the discussion of dependencies has been the assumption that the 
processor petforms out-of-order issue and already has a reorder buffer for recovering 
from mispredicted branches. Out-of-order issue makes it unacceptable to stall the 

20 decoder for dependencies. If the processor has an instruction window, it is 
inconsistent to limit the look ahead capabiUty of the processor by interlocking the 
decoder. There are then only two alternatives: implement anti- and output 
dependency interlocks in the window or remove these altogether with renaming. 

25 SUMMARY OF THE INVENTTOlSr 

The present invention is directed to instruction scheduling including register 
renaming and instruction isstdng for superscalar RISC computers. A Register 
Rename Circuit (RRC), which is part of the scheduling logic allows a computer's 

30 LMtruction Execution Unit (lEU) to execute several instructions at the same time 
while avoiding dependencies. In contrast to conventional register renaming, the 
present invention does not actually rename register addresses. The RRC of the 
present invention temporarily buffers the instruction results, and the results of out- 
of-order instruction execution are not transferred to the register file until all previous 

35 instructions are done. The RRC also performs result forwarding to provide 
temporarily buffered operands (results) to dependant instructions. The RRC 
contains three subsections: a Data Dependency Checker (DDC), Tag Assign Logic 
(TAL) and Register file Port MUXes (RPM). 

The function of the DDC is to locate the dependencies between the instructions for 
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a group of instxuctions. The DDC does this by comparing the addresses of the source 
registers of each instruction to the addresses of the destination registers of each 
previous instruction in the group. For example, if instruction A reads a value from a 
register that is written to by instruction B, then instruction A is dependent upon 

5 instruction B and instruction A cannot start until instruction B has finished. The 
DDC outputs indicate these dependencies. 

The outputs of the DDC go to the TAL. Because it is possible for an instruction to 
be dependent on more than one previous instruction, the TAL must determine which 
of those previous instructions will be the last one to be executed. The present 

10 invention automatically maps each instruction a predetermined temporary buffer 
location; hence, the present invention does not need prioritized associative look-up as 
used by convention reorder buffers, thereby saving chip area/cost and execution 
speed 

Out-of-order results for several instructions being executed at the same time are 

15 stored in a set of temporary buffers, rather that the file register designated by the 
instruction, ff the DDC determines, for example, that a register that instruction 6's 
source is written to by instructions 2, 3 and 5, then the TAL will indicate that 
instruction 6 must wait for instruction 5 by outputting the "tag'' of instruction 5 for 
instruction 6. The tag of instruction 5 shows the temporary buffer location where 

20 instruction 5*s result is stored. It also contains a one bit signal (called a "done flag") 
that indicates if instruction 5 is finished or not The TAL will output three tags for 
each instruction, because each instruction can have three source registers. If an 
instruction is not dependent on any previous instruction, the TAL will output the 
register file address of the instruction's input, rather an a temporary buffer's address. 

25 The last part of the EEC are the RPMs or Register file Port MUXes. The inputs 
of the RPMs are the outputs of the TAL, and the select lines for the RPMs come from 
another part of the lEU called the Instruction Scheduler or Issuer. The Instruction 
Scheduler chooses which instruction to execute (this decision is based partly on the 
done flags) and then uses the RPMs to select the tags of that instruction. These tags 

30 go to the read address ports of the computer's register files. In the inrevioixs example, 
once instruction 5 has finished, the Instruction Scheduler will start instruction 6. It 
will select the RPM so that the address of instruction 5's result (its tag) is sent to the 
register file, and the register file will make the result of instruction 5 available to 
instruction 6. 

35 The foregoing and other features and advantages of the present invention will be 
apparent from the following more particular description of the preferred 
embodiments of the invention, as illustrated in the accompanying drawings. 
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BRIEF DESCRIPTIQN OF THE DRAWINGS 

The invention wiU be better tmderstood if reference is made to the accoini>anying 
drawings. 

5 FIG. 1 shows a representative high level block diagram of the register renaming 
circuit of the present invention. 

FIG. 2 shows a representative block diagram of the data dependency check circuit 
of the present inventioa 

FIG. 3 shows a representative block diagram of the tag assignment logic of the 
1 0 present invention. 

FIG. 4 shows a representative block diagram of the register port file multiplexers 
of the present invention. 

FIG. 5 is a representative flowchart showing a data dependency check method for 
IXSl and lYS/D in accordance with the present invention. 
15 FIG's. 6A and 6B are representative flowcharts showing a tag assignment 
method in accordance with the present invention. 

FIG. 7 shows a representative block diagram which compares an instruction Ys 
source/destination operand with each operand of an instruction X in accordance with 
an embodiment of the present invention. 
20 FIG. 8 shows a representative circuit diagram fi)r comparator block 706 of FIG. 7 . 

FIG. 9 shows a representative block diagram of a Priority Encoder in accordance 
with an embodiment of the present invention. 

FIG. 10 shows a representative block diagram of the instruction scheduling logic 
of the present invention. 

25 

DETAILED DESCRIPTIQN 

FIG. 1 shows a representative high level block diagram of an Instruction 
Execution Unit (lEU) 100 associated with the present invention. The goal of lEU 

30 100 is to execute as many instructions as possible in the shortest amount of time. 
There are two basic ways to accomplish this: optimize lEU 100 so that each 
instruction takes as little time as possible or optimize lEU 100 so that it can execute 
several instructions at the same time. 

Instructions are sent to lEU 100 firom an Instruction Fetch Umt (IFU, not 

35 shown) through an instruction FIFO (first-in-first-out register stack storage device) 
101 in groups of four called "buckets." lEU 100 can decode and schedule up to two 
buckets of instructions at one time. FIFO 101 stores 16 total instructions in four 
buckets labeled 0-3. lEU 100 looks at the an instruction window 102. In one 
embodiment of the present invention, window 102 comprises eight instructions 



-9- 



wo 93/20505 



PCT/JP93/00375 



(buckets 0 and 1). Every cyde lEU 100 tries to issue a maximum number of 
instructions from window 102. Window 102 functions as a instruction buflfer register. 
Once the instructions in a bucket are executed and their results stored in the 
processor's register ffle (see block 117), the bucket is flushed out a bottom 104 and a 

5 new bucket is dropped in at a top 106. 

In order to execute instructions in parallel or out of order, care must be taken so 
that the data that each instruction needs is available when the instruction needs it 
and also so that the result of each instruction is available &r any fiiture iiistructions 
that might need it. A Register Rename Circuit (RRC), which is part of the scheduling 

10 logic of the computer's lEU performs this function by locating dependencies between 
current instructions and then renaming the sources (inputs) of the instruction. 

As noted above, there are three types of dependendes: input dependencies, output 
dependencies and anti-dependencies. Input dependencies occur when an instruction, 
call it A, that performs an operation on the result of a previous instruction, call it B. 

15 Ouiput dependendes occur when the outputs of A and B are to be stored in the same 
place. Anti-dependendes occur when instruction A comes before B in the instruction 
stream and B's result will be stored in the same place as one of A's inputs. 

Input dependendes are handled by not executing instructions until their inputs 
are available. RRC 112 is used to locate the input dependendes between current 

20 instructions and then to signal an Instruction Scheduler or Issuer 118 when all 
inputs for a particular instruction are ready. In order to locate these dependendes, 
RRC 112 compares the register file addresses of each instruction's inputs with the 
addresses of each previous instruction's output using a data dependency drcuit 
(DDC) 108. If one instruction's input comes from a register where a previous 

25 instruction's output will be stored, then the latter instruction must wait for the 
former to finish. 

This implementation of RRC 112 can check eight instructions at the same time, 
so a current instruction is defined as any one of those eight from window 102. It 
should become evident to those skilled in the art that the present invention can easily 

30 be adapted to check more or less instructions. 

la one embodiment of the present invention, instructions can have firom 0 to 3 
inputs and 0 or 1 outputs. Most instructions' inputs and outputs come &om, or are 
stored in, one of several register files. Each register file 117 (e.g., separate integer, 
floating and boolean register files) has 32 real entries plus the group of 8 temporary 

35 buffers 116. When an instruction completes, (The term "complete" means that the 
operation is complete and the operand is ready to be written to its destination 
register.) its result is stored in its preassigned location in the temporary buffers 116. 
Its result is later moved to the appropriate place in register file 117 after all previous 
instructions' results have been moved to their places in the register file. This 
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movement of results from temporary buffers 116 to register £Qe 117 is called 
"retirement" and is controlled by termination logic, as should become evident to those 
skilled in the art More than one instruction may be retired at a time. Retirement 
comprises updating the "official state" of the machine including the computer's 
5 Program Counter, as will become evident to those skilled in the art. For example, if 
instruction 10 happens to complete directiy before instruction II, both results can be 
stored directly into register file 117. But if instruction 13 then completes, its result 
must be stored in temporary bxi£fer 116 until instruction 12 completes. By having 
lEU 100 store each instruction s result in its preassigned place in tiie temporary 

10 buffers 116, lEU 100 can execute instructions out of program order and still avoid 
the problems caused by output and anti-dependencies. 

RRC 112 sends a bit map to an Instruction Scheduler 118 via a bus 120 
indicating which instructions in window 102 are ready for issuing. Listruction decode 
logic (not shown) indicates to Issuer 118 the resource requirements for each 

15 instruction over a bus 123. For each resource in lEU 100 (e.g., each ftmctional unit 
being an adder, multiplier, shifter, or the like). Issuer 118 scans this information and 
selects the first and subsequent instructions for issuing by sending issue signals over 
bus 121. The issue signals select a group of Register File Port MUXes (RPMs) 124 
inside RRC 112 whose inputs are the addresses of each instruction's inputs. 

20 Because the results may stay in temporary bufifer 116 several cydes before going 
to register file 117, a mechanism is provided to get results from temporary buffer 116 
before they go to register file 117, so the information can be used as operands for 
other instructions. This mechanism is called "result forwarding," and without it. 
Issuer 118 would not be able to issue instructions out of order. This result forwarding 

25 is done in register file 117 and is controlled by RRC 112. The control signals 
necessary for performing the result forwarding will be come evident to those skilled in 
the art, as should the random logic used for generating such control signals. 

If an instruction is not dependent on any of the current instructions result 
forwarding is not necessary since the instruction's inputs are already in register file 

30 117, When Issuer 118 decides to execute that instruction, RRC 112 tells register file 
117 to output its data. 

RRC 112 contains three subsections: a Data Dependency Checker (DDC) 108, 
Tag Assign Logic (TAL) 122 and Register File Port MUXes (RPM) 124. DDC 108 
determines where the input dependencies are between the current instructions. TAL 

35 122 monitors the dependencies for Issuer 118 and controls result forwarding. RPM 
124 is controlled by Issuer 118 and directs the outputs of TAL 122 to the appropriate 
register file address ports 119. Instructions are passed to DDC 108 via bus 110. All 
source registers are compared with all previous destination registers for each 
instruction in window 102. 
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Each instruction has oiiy one destination, which may he a double register in one 

embodiment An instruction can only depend on a previous instruction and may have 

up to three source registers. There are various register file source and destination 

addresses that need to be checked against each other for any dependencies. As noted 

5 above, the eight bottom instructions corresponding to the lower two buckets are 

checked by DDC 108. AD. source register addresses are compared with all previous 

destination register addresses for the instructions in window 102. 

For example, let's say a program has the following instruction sequence: 
addR0,Rl,R2 (0) 
10 addR0,R2,R3 (1) 

addR4,R5,R2 (2) 
addR2,R3,R4 (3) 

The first two registers in each instruction 0-3 are the source registers, and the 

15 last listed register in each instruction is the destination register. For example, RO 

and Rl are the source registers for instruction 0 and R2 is the destination register. 

Instruction 0 adds the contents of registers 0 and 1 and stores the result in R2. For 

instructions 1-3 in this example, the following are the comparisons needed to 

evaluate all of the dependencies: 

20 I1S1,I1S2 vs. lOD 

I2S1,I2S2 vs. IID, lOD 

I3S1,I3S2 vs. I2D, I1D,I0D 

The key to the above is as follows: IXRSl is the address of source (input) number 
25 1 of instruction X; IXRS2 is the address of source (input) number 2 of instruction X; 
and IXD is the address of the destination (output) of instruction X. 

Note also that RRC 112 can ignore the fact that instruction 2 is output 
dependent on instruction 0, because the processor has a temporary buflfer where 
instruction 2*s result can be stored without interfering with instruction O's result. As 
30 discussed before, instruction 2's result will not be moved firom temporary buflfers 116 
to register file 117 until instructions 0 and Ts results are moved to register file 117. 

The number of instructions that can be checked by RRC 112 is easily scaleable. 

In order to check eight instructions at a time instead of &ur^ the following additional 

comparisons would also need to be made: 
35 I4S1,I4S2 vs I3D,I2D,I1D,I0D 

I5S1,I5S2 vs I4D,I3D,I2D, 110,101) 
I6S1,I6S2 vs I5D,I4D,I3D,I2D,I1D,I0D 
I7S1,I7S2 vs I6D,I5D,I4D,I3D,I2D,I1D,I0D 

There are several special cases that RRC 112 must handle in order to do the 
40 dependency check. First, there are some instructions that use the same register as 
an input and an output. Thus, RRC 112 must compare this source/destination 
register address with the destination register addresses of all previoiis instructions. 
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So for instruction 7, the following comparisons would be necessary: 
I7Sl,I7S2,I7aD vs. I6D,I5D,I4D,I3D,I2D,I1D,I0D. 
Another special case occurs when a program contains instructions that generate 
64 bit outputs (called long-word operations). These instructions need two registers in 
5 which to store their results. In this embodiment, these registers must be sequential. 
Thus if RRC 112 is checking instruction 4's dependencies and instruction 1 is a long- 
word operation, then it must do the following comparisons: 
I4S1,I4S2 vs. I3D,I2D,I1D,I1D+1,I0D 

Sometimes, instructions do not have destination registers. Thus KRC 112 must 

10 ignore any dependencies between instructions without destination registers and any 
future instructions. Also, instructions may not have only one valid source register, so 
RRC 112 must ignore any dependencies between the tmused source register (usually 
S2) and any previous instructions. 

RRC 112 is also capable of dealing with multiple register files. When using 

15 multiple register files, dependencies only occur when one instruction's source register 
has the same address and is in the same register file as some other instruction's 
destination register. RRC 112 treats the information regarding which register file a 
particular address is firom as part of the address. For example, in an implementation 
using four 32 bit register files, RRC 112 would do 7 bit compares instead of 5 bit 

20 compares (5 for the address and 2 for the register file). 

Signals indicating which instructions are long-word operations or have invalid 
source or destination registers are sent to RRC 112 firom Instruction Decode Logic 
(IDL; not shown). IDL also tells RRC 112 which register file each instruction's 
sources and destinations will come from or go to. 

25 A block diagram of DDC 108 is shown in FIG. 2. Source address signals arrive 
firom IFIPO 101 for all eight instructions of window 102. Additional inputs include 
long-word load operation flags, raster file decode signals, invalid destination register 
flags, destination address signals and addressing mode flags for all eight instructions. 
DDC 208 comprises 28 data dependency blocks 204. Each block 204 is described 

30 in a KEY 206. Each block 204 receives 3 inputs, IXSl, IXS2 and DCS/D. IXSl is the 
address of source (input) number 1 of instruction X, IXS2 is the address of source 
(input) number 2 of instruction X; and IXS/D is the address of the source/destination 
(input) of instruction X. Each block 204 also receives input lYS^, which is the 
destination register address for some previous instruction Y. A top row 208, for 

35 example, receives lOS/D, which is the destination register address for instruction 0. 
Each block 204 outputs the data dependency results to one of a corresponding bus 
line 114. For example, the address of I2S/D must be checked with operand addresses 
SI, S2 and S/D of instructions 7, 6, 5, 4, and 3. 
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Each block 204 performs the three comparisons. To illustrate these 

comparisons, consider a generic block 700 shown in FIG. 7, which compares 

instruction Ys source/destination operand with each operand of instruction X. In this 

example, the three following comparisons must be made: 
5 IXS1 = IYS/D 

IXS2 = r?S/D 
IXS/D = IYS/D 

These comparisons are represented by three comparator blocks 702, 704 and 

10 706, respectively. One set of inputs to comparator blocks 702, 704 and 706 are the 
bits of the lYS/D field, which is represented by number 708. Comparator block 702 
has as its second set of inputs the bits of the IXSl. Similarly, comparator block 704 
has as its second set of inputs the bits of the IXSl, and comparator block 706 has as 
its second set of inputs the bits of the IXS/D, 

15 In a preferred embodiment, the comparisons performed by blocks 702, 704 and 
706 can be performed by random logic. An example of random logic for comparator 
block 706 is shown in FIG. 8. Instruction Ys source/destination bits [6:0] are shown 
input from the right at reference number 802 and instruction Xs source/destination 
bits [6:0] are shown input from the top at reference number 804. The most 

20 significant bit (MSB) is bit 6 and the least significant bit (LSB) is bit 0. The 
corresponding bits from the two operands are fed to a set of seven exclusive NOR 
gates (XNORs) 806. The outputs of XNORs 806 are then ANDed by a seven input 
AND gate 808. If the corresponding bits are the same, the output of XNOR 806 will 
be logic high. When all bits are the same, all seven XNOR 806 outputs are logic hi^ 

25 and the output of AND gate 808 is logic high, this indicates that there is a 
dependency between DGS/D and lYS/D. 

The random logic for comparator blocks 702 and 704 will be identical to that 
shown in FIG. 8. The present invention contemplates many other random logic 
circuits for performing data dependency checking, as will become evident to those 

30 skilled in the art without departing firom the spirit of this example. 

As will fiuther become evident to those skilled in the art, various implementation 
spedfic special cases can arise which require additional random logic to perform data 
dependency checking. An illustrative special data dependency checking case is for 
long word handling. 

35 As mentioned before, if a long word operation writes to register X, the first 32 bits 
are written to register X and the second 32 bits are written to register X+1. The data 
dependency checker therefore needs to check both resistors when doing a 
comparison. In a preferred embodiment, register X is an even register, X+1 is an odd 
register and thus they only differ by the LSB. The easiest way to check both 

40 registers at the same time is to simply ignore the LSB. In the case of a store long 
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(STLG) or load long (LDLG) operation, if X and Y only differ by the LSB bit [0], the 
logic in FIG. 8 would cause there to be no dependency, when there really is a 
dependency. Therefore, for a long word operation the STLG and LDLG flags must be 
ORed with the output of the [0] bit XNOR to assure that all dependencies are 
5 detected. 

A data dependency check flowchart for IXSl and lYS/D is shown in FIG. 5. DDC 
108 first checks whether IXSl and lYS/D are in the same register file, as shown at a 
conditional block 502. they are not in the same register file there is no dependency. 
This is shown at block a 504. If there is a dependency, DDC 108 then determines 

10 whether IXSl and lYS/D are in the same register, as shown at a block 506. If they 
are not in the same register, flow proceeds to a conditional block 508 where DDC 108 
determines whether lY is a long word operation. If lY is not a long word operation 
there is no dependency and flow proceeds to a block 504. If lY is a long word 
operation, flow then proceeds to a conditional statement 510 where DDC 108 

IS determines whether IXSl and IYS/D+1 are the same register. If they are not, there 
is no dependency and flow proceeds to a block 504. If IXSl and IYS/D+1 are the 
same register, flow proceeds to a conditional block 512 where DDC 108 determines if 
lY has a valid destination. If it does not have a valid destination, there is no 
dependency and flow proceeds to block 504. If lY does have a vahd destination, flow 

20 proceeds to a conditional block 514 where DDC 108 determines if IXSl has a valid 
source register. Again, if no valid source register is detected there is no dependency, 
and flow proceeds to a block 504. If a valid source register is detected, DDC 108 has 
determined that there is a dependency between IXSl and lYX/D, as shown at a block 
516. 

25 A more detailed discussion of data dependency checking is found in commonly 
owned, copending application Serial No. 07/860,718 (Attorney Docket No. SP041, 
the disclosure of which is incorporated herein by reference. 

Because it is possible that an instruction might get one of its inputs firom a 
register that was written to by several other instructions, the present invention 

30 must choose which one is the real dependency. For example, if instructions 2 and 5 
write to register 4 and instruction 7 reads register 4, then instruction 7 has two 
possible dependencies. In this case, it is assumed that since instruction 5 came after 
instruction 2 in the program, the programmer intended instruction 7 to use 
instruction 5*s result and not instruction 2's. So, if an instruction can be dependent 

35 on several previous instructions, HRC 112 will consider it to be dependent on the 
highest numbered previous instruction. 

Once TAL 122 has determined where the real dependencies are, it must locate the 
inputs for each instruction. In a preferred embodiment of the present invention, the 
inputs can come from the actual register file or an array temporary buffers 116. 



-15- 



wo 93/20505 



PCT/JP93/00375 



EEC 112 assumes that an instruction has no dependencies, its inputs are all in the 

register ffle. In this case, RRC 112 passes the IXSl, IXS2 and US/D addresses that 

came from IFIFO 102 to the register file. If an instruction has a dependency, then 

ERG 112 assumes that the data is in temporary buffers 116. Snce RRC 112 knows 

which previous instruction each instruction depends on, and since eadi instruction 

always writes to the same place in temporary buffers 116, RRC 112 can determine 

where in temporary biiffers 116 an instruction's inputs are stored. It sends these 

addresses to register file read ports 119 and register file 117 outputs the data from 

temporary buffers 116 so that the instruction can use it. 

The following is an example of tag assigmnents: 
0:addi0,rl,r2 
I:addr0,r2,r3 
2: add r4, rS, r2 
3:addr2,r3,r4 

The following are the dependencies for the above operations (dependendes are 

represented by tiie symbol "#"): 
I1S2#I0S/D 
I3S1#I0S/D 
I3S1#I2S/D 
I3S2#I1S/D 



First, look at 10; since it has no dependencies, its tags are equal to its original 

source register addresses: 

IOSlTAG = IOSl = rO 
I0S2TAG = I0S2 = rl 
I0S^TAG = IOS/D = r2 



II has one dependency, and its tags are as follows: 

IlSlTAG = IlSl = rO 
I1S2 TAG = lOS/D = to 

where: (tO = inst. O's slot in temporary buffer) 
IlS/DTA6 = IlS/D = r3 



12 is also indQ)endent: 

I2SlTAG = I2Sl = r4 
1262 TAG = I2S2 = r5 
I2S/DTAG = I2S/D=r2 



I3S1 has two possible dependencies, lOS^ and 12S/D. Because TAL 122 must 
pick the last one (highest numbered one), I2S/D is chosen. 

I3S1TAG = I2S/D = t2 
I3S2TAG = I1S/D = tl 
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I3S/DTAG = I3S^ = r4 

These tags are then sent to RPM 124 via bus 126 to be selected by Issuer 118. 

At the same time TAL 122 is preparing the tags, it is also monitoring the outputs of 

5 DCL 130 and passing them on to Issuer 118 using bus 120. TAL 122 chooses the 

proper outputs of DCL*s 130 to pass to Issuer 118 by the same method that it 

chooses the tags that it sends to RPM 124. 

Continuing the example, TAL 122 sends the following ready signals to Issuer 118: 
IOS1INFO = 1 

1 0 (Inst 0 is independent so it can start immediately) 

I0S2 INFO = 1 
IOS/DINFO = 1 

IlSl INFO = 1 
15 I1S2 INFO= DONE[0] 

(DONE[0] = 1 when 10 is done) 
IIS/D INFO = 1 

I2S1 INFO = 1 
20 I2S2 INFO = 1 

I2S/DINF0 = 1 

I3S1 INF0 = D0NE[2] 
I3S2INF0 = D0NE[1] 
25 I3S/D READ = 1 

(The DONE signals come from DCL 130 via a bus 132. In connection with the 
present invention, the term "done" means the result of the instruction is in a 
temporary buffer or otherwise available at the output of a functional unit. 
30 Contrastingly, the term "terminate" means the result of the instruction is in the 
register file.) 

Turning now to FIG. 3, a representative block diagram of TAL 122 will be 
discussed. TAL 122 comprises 8 tag assignment logic blocks 302. Each TAL block 
302 receives the corresponding data dependency results via buses 114, as well as 

35 further signals that come from the computer's Instruction Decode and control logic 
(not shown). The BKT bit signal forms the least significant bit of the tag. DONEKl 
flags are for instructions 0 through 6, and indicate if instruction X is done. 
DBLREG[X] flags indicates which, if any, of the instructions is a double Gong) word. 
Each TAL block 302 also receives its own instructions register addresses as inputs. 

40 The Misc. signals, DBLREG and BKT signals are all implementation dependent 
control signals. Each TAL block 302 outputs 3 TAGs 126 labeled IXSl, IXS2 and 
IXS/D, which are 6 bits. TAL 122 outputs the least significant 5 bits of each TAG 
signal to KPMs 124 and the most significant TAG to Issuer 118. 

Each block 302 of FIG. 3 comprises three Priority Encoders (PE), one for Si, one 
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for S2 and one for S/D, There is one exception however. 10 requires no tag 
assignment Its tags are the same as the original SI, S2 and S/D addresses, becaiise 
10 is always independent. 

An illustrative PE is shown in FIG. 9. PE 902 has eight inputs 904 and eight 
5 outputs 906. Inputs 904 for PE 902 are outputs 114 from DDC 108 which show 
where dependencies exist. For example, in the case of source register 1 (SI), I7S1 tag 
assign PE 902's seven inputs are the seven outputs 114 of DDC 108 that indicate 
whether I7S1 is dependent on I6D, whetiier I7S1 is dependent on BD, and so on down 
to whether I7S1 is dependent on lOD. An eighth input, shown at reference number 

10 908, is always tied hi§^ because there should always be an output from PE 902. 

As stated before, if an instruction depends on several previous instructions, PE 
902 will select and output only the most previous instruction (in program order) on 
which there is a dependency. This is accomplished by coimecting the signal showing 
if there is a dependency on the most previous instruction to the hi^est priority input 

15 of the PE 902 and the signal showing if there is a dependency on the second most 
previous instruction to the input of PE 902 with the second highest priority and so on 
for all previous instructions. The input of the PE 902 with the lowest priority is 
always tied high so that at least one of PE 902's outputs will be asserted. 

Outputs 906 are used as select lines for a MUX 910. MUX 910 has eight inputs 

20 912 to which the tags for each instruction are applied. 

To illustrate this, assume that 17 depends on 16 and 15, then, since 16 has a hi^er 
priority than K, the bit corresponding to 16 at outputs 906 of PE 902 will be high. At 
the corresponding input 912 of MUX 910 will be I6*s tag for SI (recall PE 902 is for 
I7S1). Because 17 is dependent on 16, the location of I6's result must be output frx)m 

25 MUX 910 so that it can be used by 17. I6's tag will therefore be selected and output 
on an output line 914. I6's done flag, D0NE[6] must also be output frt>m MUX 910 
so that Issuer 118 will know when ITs input is ready. This data is passed to Issuer 
118 via bus 120. Since an instruction can have up to three sources, TAL 122 
monitors up to three dependencies for each instruction and sends three vectors for 

30 each instruction (totalling 24 vectors) to Issuer 118. If an instruction is independent, 
TAL 122 signals to Issuer 118 that the instruction can begin immediately. 

The MSB of the tag outputs which are sent to RPMs 124 is used to indicate if the 
address is a register file address or a temporary buffer address. If an instruction is 
independent, then the five LSB outputs indicate the source register address. For 

35 instructions that have dependencies: the second MSB indicates that the address is 
for a 64 bit valve; the third through fifth MSB outputs specify the temporary buffer 
address; and the LSB output indicates which bucket is the current bucket, which is 
equal to the BKT signal in TAL 122. 

like DDC 108, TAL 122 has numerous implementation dependent, (i.e., special 
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cases) that it handles. First, in an embodiment of the present invention, register 
number 0 of the register file is always equal to 0. Therefore, even if one instruction 
writes to register 0 and another reads from register 0, tiiere will be no dependency 
between them. TAL 122 receives three signals firom Instruction Decode Logic (IDL; 
5 not shown) for each instruction to indicate if one of that instruction's sources is 
register 0. If any of those is asserted, TAL 122 will ignore any dependencies for that 
particular input of that instruction. 

Another special case occurs because under some circumstances, an instruction in 
bucket 0 will be guaranteed to not have any of the instructions in bucket 1 dependent 

10 on it. A four bit signal called BKT1_N0DEP. is sent to RRC 112 from the lEU 
control logic (not shown) and if BKT1.N0DEP[X] = 1 then RRC 112 knows to ignore 
any dependencies between instructions, 4,5,6 or 7 and instruction X, 

An example for TAG assignment of instruction 7*s source 1 (I7S1) is shown in a 
flowchart in FIG's, SA-BB. TAL 122 first determines whether I7S1 is register 0, as 

15 shown at a conditional block 602, If the first source operand for 17 is register 0, the 
TAG is set equal to zero, and the I7Sl's INFO flag is set equal to one, as shown in a 
block 604, If the first source operand (SI) for 17 is not register 0, TAL 122 then 
determines if I7S1 is dependent on I6S/D, as shown at a conditional block 606. If 
I7S1 is dependent on I6S/D flow then proceeds to a block 610 where I7Sl's TAG is 

20 set equal to {1 J)BLREG[6],0,1,0,BKT} and I7Sl's INFO flag is set equal to D0NE[6], 
as shown at a block 610. If either of the condition tested at a conditional block 606 is 
not met, flow proceeds to conditional block 612 where TAL 122 determines if I7S1 is 
dependent on I5S/D. If there is a dependency, flow then proceeds to block 616 where 
TAL 122 sets I7Sl's TAG equal to {1,DBLREG[5],0,0,1,BKT} and I7Srs INFO flag 

25 is set equal to D0NE[5]. If the condition tested at block 612 is not met, flow proceeds 
to a block 618 where TAL 122 determines if I7S1 is dependent on I4S/D. 

As evident by inspection of the remaining sections of FIG's. 6A and 6B, similar 
TAG determinations are made depending on whether I7S1 is dependent on I4S/D, 
I3S/D, I2S/D, IIS/D and lOS/D, as shown at sections 620, 622, 624, 626 and 628, 

30 respectively. Finally, if instruction 7 is independent of instruction 0 or if all 
instructions in bucket 1 are independent of instruction 0 (i.e., if BKTl JTODEP[D] = 
1), as tested at a conditional block 630, the flow proceeds to block 632 where TAL 
122 sets I7S1*8 TAG equal to {0,I7S1} and I7Sl*s INFO flag equal to 1, It should be 
noted for the above example that I7S1 TAG signals are forwarded directly the 

35 register file port MUXes of register file 117. The I7S1 INFO signals are sent to 
Issuer 118 to tell it when I7*s SI input is ready. 

A representative block diagram of Issuer 118 is shown in FIG. 10. In a preferred 
embodiment. Issuer 118 has one scaimer block 1002 for each resoiirce (functional 
unit) that has to be allocated. In this example, Issuer 118 has scanner blocks FUl, 
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FU2, FU3, PU4 through FUn. Requests for functional units are generated from 
instruction information by decoding logic (not shown) in a known manner, which are 
sent to scanners 1002 via bus 123. Each scanner block 1002 scans firom instruction 
10 to 17 and selects the first request for the corresponding functional unit to be 
5 serviced during that cycle. 

In the case of multiple register files (int^r, floating and/or boolean), Issuer 118 is 
capable of issuing instructions having operands stored in different register files. For 
example, an ADD instruction may have a first operand firom the floating point 
register file and a second operand firom the integer register file. Listructions with 

10 operands firom different register files are typically given hi^er issue priority (i.e., 
they are issued first). This issuing technique conserves processor execution time and 
functional unit resources. 

In a further embodiment in which lEU 100 may include two ALU's, ALU 
scanning becomes a bit more complicated. For speed reasons, one ALU scanner 

15 block scans firom 10 to 17, while the other scanner block scans firom 17 to 10. This is 
how two ALU requests are selected. With this scheme it is possible that an ALU 
instruction in bucket 1 wiQ get issued before an ALU instruction in bucket 0, while 
increasing scanning efidency. 

Scanner outputs 1003 are selected by MUXing logic 1004. A set of SELect inputs 

20 1006 for MUX 1004 receive three 8-bit vectors (one for each operand) firom TAL 122 
via bus 120. The vectors indicate which of the eight instructions have no 
dependencies and are ready to be issued. Issuer 118 must wait for this information 
before it can start to issue any instructions. Issuer 118 monitors these vectors and 
when all three go high for a particular instruction, Issuer 118 knows that the inputs 

25 for that instruction are ready. Once the necessary functional unit is ready, the issuer 
can issue that instruction and send select signals to the register file port MUXes to 
pass the corresponding instructions outputs to register file 117. 

In a prefenred embodiment of the present invention, after Issuer 118 is done it 
provides two 8-bit vectors per register file back to RRC 112 via MUXOUTputs 1008 

30 to bus 121. These vectors indicate which instructions are issued this pyde, are used 
a select lines for KPMs 124. 

The TnaYimnm number of instructions that can be issued simultaneously for each 
register file is restricted by the number of register file read ports available. A data 
dependency with a previous imcompleted instruction may prevent an instruction 

35 fix)m being issued. In addition, an instruction may be prevented firom being issued if 
the necessary functional unit is allocated to another instruction. 

Several instructions, such as load immediate instructions. Boolean operations and 
relative conditional branches, may be issued independentiy, because they may not 
require resources other than register file read ports or they may potentially have no 
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dependencies. 

The last section of RRC 112 is the register ffle port MUX (RPM) section 124. The 
function of RPMs 124 is to provide a way for Issuer 118 to get data out of register 
files 117 for each instruction to use. RPMs 124 receive tag information via bus 126, 
5 and the select lines for RPMs 124 come &om Issuer 118 via a bus 121 and also firom 
the computer's lEU control logic. The selected TAGs comprise read addresses that 
are sent to a predetermined set of ports 119 of regist^ file 117 using bus 128. 

The number and design of RPMs 124 depend on the number of register files and 
the number of ports on each register file. One embodiment of RPMs 124 is shown in 

10 FIG. 4. In this embodiment, RPMs 124 comprises 3 register port file MUXes 402, 
404 and 406. MUX 402 receives as inputs the TAGs of instructions 0-7 
corresponding to the sotirce register field SI that are generated by TAL 122. MUX 
404 receives as inputs the TAGs of instructions 0-7 corresponding to the source 
register field S2 that are generated by TAL 122. MUX 406 receives as inputs the 

15 TAGrS of instructions 0-7 corresponding to the source/destination register field S/D 
that are generated by TAL 122. The outputs of MUXes 402, 404 and 406 are 
connected to the read addresses ports of register file 117 via bus 128. 

RRC 112 and Issuer 118 allow the processor to execute instructions 
simidtaneously and out of program order. An lEU for use with the present invention 

20 is disclosed in commonly owned, co-pending application Serial No. 07/817,810 
(Attorney Docket No. SPO 15/1397,0280001), the disclosure of which is incorporated 
herein by reference. 

While various embodiments of the present invention have been described above, it 
should be understood that they have been presented by way of example, and not 
25 limitation. Thus the breadth and scope of the present invention should not be limited 
by any of the above-described exemplary embodiments, but should be defined only in 
accordance with the following claims and their equivalents. 
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What is claimed is: 



1 1. A register renaming system for out-of-order execution of a set of reduced 

2 instruction set computer instructions having addressable source and 

3 destination register fields, adapted for use in a computer having an instruction 

4 execution unit with a register file accessed by read address ports and for 

5 storing instruction operands, the system comprising: 

6 (a) data depaidance check means for determining data dependencies 

7 between the instructions; 

8 (b) tag assignment means for generating one of more tags to specify 

9 the location of operands, based on said data deimndendes determined by said 

1 0 data dependance check means; and 

1 1 (c) register file port means for selecting said tags generated by said 

12 tag assignment means and passing said tags onto the read address ports of 

1 3 the register file for stozing execution results. 

1 2.The system of daim 1, wherein said data dependance check means determines 

2 said data dependencies by comparmg the addresses of the source register field 

3 of each instruction to the addresses of the destination register fields. 

1 3.The system of daim 1, fiirther comprising temporary storage means for 

2 tCTiporarily storing out-of-order execution results, wherein said out-of-order 

3 execution results are passed to the register files in order after execution of the 

4 set of instructions is completed. 

1 4A register renaming naethod for performing out-of-order execution of a set of 

2 reduced instruction set computer instructions having addressable source and 

3 destination register fields, adapted for use in a computer having an instruction 

4 execution unit with a register file accessed by read address ports and for 

5 storing instruction operands, the method comprising the steps o£ 

6 (1) determining data dependendes between the instructions; 

7 (2) performing at least one of in-order and out-of-order issuing of two 

8 or more of the instructions in the instruction execution unit; 

9 (3) storing, temporarily, any out-of-order results in temporary 

10 storage means; 

1 1 (4) generating one or more tags to specify the location of said out-of- 

1 2 order results based on said data dependendes; 

13 (5) selecting appropriate ones of said tags corresponding to the 

14 issued instruction; and 
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15 (6) passing said selected tags onto the read address ports of the 

16 register file for one o£ 

17 (i) accessmg said out-of-order results; and 
1 g (ii) storing execution results. 

1 S.The method of claim 4, wherein said determining step further comprises the step of 

2 comparing the addresses of the source register field of each instruction to the 

3 addresses of the destination register fields. 

1 G.The method of claim 4, fiirther comprising the step of storing in-order results in the 

2 register file and the temporary storage means. 

1 T.The method of claim 4, fiirther comprising the step of passing said out-of-order 

2 results to the register files in-order afi;er execution of the set of instructions is 

3 completed. 

1 8 A method for issuing instructions in a superscalar reduced instruction set 

2 computer having an instruction issuer system capable of issuing a plurality of 

3 instructions in a single cyde, the system having more than one register file 

4 sets, the method comprising the steps of: 

5 prioritizing instructions to be issued according to the number of 

6 different register file sets fiimishing operands to the instruction; and 

7 issuing instructions according to said prioritizing step. 

1 9.The method according to daim 8, wherein said prioritizing step assigns a higher 

2 priority to those instructions having operands fiimished firom the greater 

3 niunber of register file sets. 

1 10 .An instruction issuer system in a superscalar reduced instruction set computer » 

2 the system capable of issuing a plurality of instructions in a single cycle^ the 

3 system having more than one set of register files, the system comprising: 

4 first means for prioritizing instructions to be issued according to 

5 the number of different register files fiimishing operands to the instruction; 

6 and 

7 second means, responsive to said first means, for issuing 

8 instructions according to said priority. 

1 11. The system according to daim 10, wherein said first means fiirther assigns a 

2 higher priority to those instructions having operands fiimished firom the greater 

3 number of register files. 
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